{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "95f0a171",
   "metadata": {},
   "source": [
    "(communicate-plots)=\n",
    "# Graphics for Communication\n",
    "\n",
    "## Introduction\n",
    "\n",
    "In this chapter, you'll learn about using visualisation to communicate.\n",
    "\n",
    "In {ref}`exploratory-data-analysis`, you learned how to use plots as tools for *exploration*.\n",
    "When you make exploratory plots, you know—even before looking—which variables the plot will display.\n",
    "You made each plot for a purpose, quickly looked at it, and then moved on to the next plot.\n",
    "In the course of most analyses, you'll produce tens or hundreds of plots, most of which are immediately thrown away.\n",
    "\n",
    "Now that you understand your data, you need to *communicate* your understanding to others.\n",
    "Your audience will likely not share your background knowledge and will not be deeply invested in the data. To help others quickly build up a good mental model of the data, you will need to invest considerable effort in making your plots as self-explanatory as possible. In this chapter, you'll learn some of the tools that **lets-plot** provides to do make charts tell a story.\n",
    "\n",
    "### Prerequisities\n",
    "\n",
    "As ever, there are a plethora of options (and packages) for data visualisation using code. We're focusing on the declarative, \"grammar of graphics\" approach using **lets-plot** here, but advanced users looking for more complex graphics might wish to use an imperative library such as the excellent **matplotlib**. You should have both **lets-plot** and **pandas** installed. Once you have them installed, import them like so:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "51a55374",
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "# remove cell\n",
    "import matplotlib.pyplot as plt\n",
    "import matplotlib_inline.backend_inline\n",
    "\n",
    "# Plot settings\n",
    "plt.style.use(\"https://github.com/aeturrell/python4DS/raw/main/plot_style.txt\")\n",
    "matplotlib_inline.backend_inline.set_matplotlib_formats(\"svg\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ae4a818a",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from lets_plot import *\n",
    "\n",
    "LetsPlot.setup_html()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a0dc9c10",
   "metadata": {},
   "source": [
    "## Labels, titles, and other contextual information\n",
    "\n",
    "The easiest place to start when turning an exploratory graphic into an expository graphic is with good labels. Let's look at an example using the MPG (miles per gallon) data, which covers the fuel economy for 38 popular models of cars from 1999 to 2008."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c36b4cd5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# load the data\n",
    "mpg = pd.read_csv(\n",
    "    \"https://vincentarelbundock.github.io/Rdatasets/csv/ggplot2/mpg.csv\", index_col=0\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1813ab08",
   "metadata": {},
   "source": [
    "We want to show fuel efficiency on the highway changes with engine displacement, in litres. The most basic chart we can do with these variables is:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c7574bc6",
   "metadata": {},
   "outputs": [],
   "source": [
    "(ggplot(mpg, aes(x=\"displ\", y=\"hwy\")) + geom_point())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ff5ed0d4",
   "metadata": {},
   "source": [
    "Now we're going to add lots of extra useful information that will make the chart better. The purpose of a plot title is to summarize the main finding.\n",
    "Avoid titles that just describe what the plot is, e.g., \"A scatterplot of engine displacement vs. fuel economy\".\n",
    "\n",
    "We're going to:\n",
    "\n",
    "- add a title that summarises the main finding you'd like the viewer to take away (as opposed to one just describing the obvious!)\n",
    "- add a subtitle that provides more info on the y-axis, and make the x-label more understandable\n",
    "- remove the y-axis label that is at an awkward viewing angle\n",
    "- add a caption with the source of the data\n",
    "\n",
    "Putting this all in, we get:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "24b3513e",
   "metadata": {},
   "outputs": [],
   "source": [
    "(\n",
    "    ggplot(mpg, aes(x=\"displ\", y=\"hwy\"))\n",
    "    + geom_point(aes(colour=\"class\"))\n",
    "    + geom_smooth(se=False, method=\"loess\", size=1)\n",
    "    + labs(\n",
    "        title=\"Fuel efficiency generally decreases with engine size\",\n",
    "        subtitle=\"Highway fuel efficiency (miles per gallon)\",\n",
    "        caption=\"Source: fueleconomy.gov\",\n",
    "        y=\"\",\n",
    "        x=\"Engine displacement (litres)\",\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2e28877a",
   "metadata": {},
   "source": [
    "This is much clearer. It's easier to read, we know where the data come from, and we can see *why* we're being shown it too.\n",
    "\n",
    "But maybe we want a different message? You can flex depending on your needs, and some people prefer to have a rotated y-axis so that the subtitle can provide even more context:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6489a6bf",
   "metadata": {},
   "outputs": [],
   "source": [
    "(\n",
    "    ggplot(mpg, aes(x=\"displ\", y=\"hwy\"))\n",
    "    + geom_point(aes(colour=\"class\"))\n",
    "    + geom_smooth(se=False, method=\"loess\", size=1)\n",
    "    + labs(\n",
    "        x=\"Engine displacement (L)\",\n",
    "        y=\"Highway fuel economy (mpg)\",\n",
    "        colour=\"Car type\",\n",
    "        title=\"Fuel efficiency generally decreases with engine size\",\n",
    "        subtitle=\"Two seaters (sports cars) are an exception because of their light weight\",\n",
    "        caption=\"Source: fueleconomy.gov\",\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9d88f188",
   "metadata": {},
   "source": [
    "### Exercises\n",
    "\n",
    "1.  Create one plot on the fuel economy data with customized `title`, `subtitle`, `caption`, `x`, `y`, and `color` labels.\n",
    "\n",
    "2.  Recreate the following plot using the fuel economy data.\n",
    "    Note that both the colours and shapes of points vary by type of drive train."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "683d547c",
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "(\n",
    "    ggplot(mpg, aes(x=\"cty\", y=\"hwy\", color=\"drv\", shape=\"drv\"))\n",
    "    + geom_point()\n",
    "    + labs(\n",
    "        x=\"City MPG\",\n",
    "        y=\"Highway MPG\",\n",
    "        shape=\"Type of\\ndrive train\",\n",
    "        color=\"Type of\\ndrive train\",\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e10cdbd9",
   "metadata": {},
   "source": [
    "3.  Take an exploratory graphic that you've created in the last month, and add informative titles to make it easier for others to understand."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "289418c9",
   "metadata": {},
   "source": [
    "## Annotations\n",
    "\n",
    "In addition to labelling major components of your plot, it's often useful to label individual observations or groups of observations.\n",
    "The first tool you have at your disposal is `geom_text()`.\n",
    "`geom_text()` is similar to `geom_point()`, but it has an additional aesthetic: `label`.\n",
    "This makes it possible to add textual labels to your plots.\n",
    "\n",
    "There are two possible sources of labels: ones that are part of the data, which we'll add with `geom_text()`; and ones that we add directly and manually as annotations using `geom_label()`.\n",
    "\n",
    "In the first case, you might have a data frame that contains labels.\n",
    "In the following plot we pull out the cars with the highest engine size in each drive type and save their information as a new data frame called `label_info`. In creating it, we pick out the mean values of \"hwy\" by \"drv\" as the points to label—but we could do any aggregation we feel would work well on the chart."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "60826a32",
   "metadata": {},
   "outputs": [],
   "source": [
    "mapping = {\n",
    "    \"4\": \"4-wheel drive\",\n",
    "    \"f\": \"front-wheel drive\",\n",
    "    \"r\": \"rear-wheel drive\",\n",
    "}\n",
    "label_info = (\n",
    "    mpg.groupby(\"drv\")\n",
    "    .agg({\"hwy\": \"mean\", \"displ\": \"mean\"})\n",
    "    .reset_index()\n",
    "    .assign(drive_type=lambda x: x[\"drv\"].map(mapping))\n",
    "    .round(2)\n",
    ")\n",
    "label_info"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "93a136fe",
   "metadata": {},
   "source": [
    "Then, we use this new data frame to directly label the three groups to replace the legend with labels placed directly on the plot. Using the fontface and size arguments we can customize the look of the text labels. They’re larger than the rest of the text on the plot and bolded. (`theme(legend.position = \"none\")` turns all the legends off — we’ll talk about it more shortly.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6f90c2aa",
   "metadata": {},
   "outputs": [],
   "source": [
    "(\n",
    "    ggplot(mpg, aes(x=\"displ\", y=\"hwy\", color=\"drv\"))\n",
    "    + geom_point(alpha=0.5)\n",
    "    + geom_smooth(se=False, method=\"loess\")\n",
    "    + geom_text(\n",
    "        aes(x=\"displ\", y=\"hwy\", label=\"drive_type\"),\n",
    "        data=label_info,\n",
    "        fontface=\"bold\",\n",
    "        size=8,\n",
    "        hjust=\"left\",\n",
    "        vjust=\"bottom\",\n",
    "    )\n",
    "    + theme(legend_position=\"none\")\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "98c17829",
   "metadata": {},
   "source": [
    "Note the use of `hjust` (horizontal justification) and `vjust` (vertical justification) to control the alignment of the label.\n",
    "\n",
    "\n",
    "The second of the two methods we're looking at is `geom_label()`. This has two modes: in the first, it works like `geom_text()` but with a box around the text, like so:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bdcd79bb",
   "metadata": {},
   "outputs": [],
   "source": [
    "potential_outliers = mpg.query(\"hwy > 40 | (hwy > 20 & displ > 5)\")\n",
    "(\n",
    "    ggplot(mpg, aes(x=\"displ\", y=\"hwy\"))\n",
    "    + geom_point(color=\"black\")\n",
    "    + geom_smooth(se=False, method=\"loess\", color=\"black\")\n",
    "    + geom_point(\n",
    "        data=potential_outliers,\n",
    "        color=\"red\",\n",
    "    )\n",
    "    + geom_label(\n",
    "        aes(label=\"model\"),\n",
    "        data=potential_outliers,\n",
    "        color=\"red\",\n",
    "        position=position_jitter(),\n",
    "        fontface=\"bold\",\n",
    "        size=5,\n",
    "        hjust=\"left\",\n",
    "        vjust=\"bottom\",\n",
    "    )\n",
    "    + theme(legend_position=\"none\")\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "38b69dcf",
   "metadata": {},
   "source": [
    "The second method is generally useful for adding either a single or several annotations to a plot, like so:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d1e2cc3a",
   "metadata": {},
   "outputs": [],
   "source": [
    "import textwrap\n",
    "\n",
    "# wrap the text so it is over multiple lines:\n",
    "trend_text = textwrap.fill(\"Larger engine sizes tend to have lower fuel economy.\", 30)\n",
    "trend_text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e8c09f57",
   "metadata": {},
   "outputs": [],
   "source": [
    "(\n",
    "    ggplot(mpg, aes(x=\"displ\", y=\"hwy\"))\n",
    "    + geom_point()\n",
    "    + geom_label(x=3.5, y=38, label=trend_text, hjust=\"left\", color=\"red\")\n",
    "    + geom_segment(x=2, y=40, xend=5, yend=25, arrow=arrow(type=\"closed\"), color=\"red\")\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0720e7eb",
   "metadata": {},
   "source": [
    "Annotation is a powerful tool for communicating main takeaways and interesting features of your visualisations. The only limit is your imagination (and your patience with positioning annotations to be aesthetically pleasing)!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9c00a0fd",
   "metadata": {},
   "source": [
    "Remember, in addition to `geom_text()` and `geom_label()`, you have many other geoms in **lets-plot** available to help annotate your plot.\n",
    "A couple ideas:\n",
    "\n",
    "-   Use `geom_hline()` and `geom_vline()` to add reference lines.\n",
    "    We often make them thick (`size = 2`) and grey (`color = gray`), and draw them underneath the primary data layer.\n",
    "    That makes them easy to see, without drawing attention away from the data.\n",
    "\n",
    "-   Use `geom_rect()` to draw a rectangle around points of interest.\n",
    "    The boundaries of the rectangle are defined by aesthetics `xmin`, `xmax`, `ymin`, `ymax`.\n",
    "\n",
    "-   You already saw the use of `geom_segment()` with the `arrow` argument to draw attention to a point with an arrow.\n",
    "    Use aesthetics `x` and `y` to define the starting location, and `xend` and `yend` to define the end location.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "730162e6",
   "metadata": {},
   "source": [
    "### Exercises\n",
    "\n",
    "1.  Use `geom_text()` with infinite positions to place text at the four corners of the plot.\n",
    "\n",
    "2.  Use `geom_label()` to add a point geom in the middle of your last plot without having to create a data frame\n",
    "    Customise the shape, size, or colour of the point.\n",
    "\n",
    "3.  How do labels with `geom_text()` interact with faceting?\n",
    "    How can you add a label to a single facet?\n",
    "    How can you put a different label in each facet?\n",
    "    (Hint: Think about the dataset that is being passed to `geom_text()`.)\n",
    "\n",
    "4.  What arguments to `geom_label()` control the appearance of the background box?\n",
    "\n",
    "5.  What are the four arguments to `arrow()`?\n",
    "    How do they work?\n",
    "    Create a series of plots that demonstrate the most important options.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2f665492",
   "metadata": {},
   "source": [
    "## Scales\n",
    "\n",
    "Another you can make your plot better for communication is to adjust the scales.\n",
    "Scales control how the aesthetic mappings manifest visually.\n",
    "\n",
    "### Default scales\n",
    "\n",
    "Normally, **lets-plot** automatically adds scales for you and you don't need to worry about them. For example, when you type:\n",
    "\n",
    "```python\n",
    "(\n",
    "    ggplot(mpg, aes(x=\"displ\", y=\"hwy\")) +\n",
    "    geom_point(aes(color=\"class\"))\n",
    ")\n",
    "```\n",
    "\n",
    "**lets-plot** is automatically doing this behind the scenes:\n",
    "\n",
    "```python\n",
    "(\n",
    "    ggplot(mpg, aes(x=\"displ\", y=\"hwy\")) +\n",
    "    geom_point(aes(color=\"class\")) +\n",
    "    scale_x_continous() +\n",
    "    scale_y_continuous() +\n",
    "    scale_color_discrete()\n",
    ")\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39332b3b",
   "metadata": {},
   "source": [
    "Note the naming scheme for scales: `scale_` followed by the name of the aesthetic, then `_`, then the name of the scale.\n",
    "The default scales are named according to the type of variable they align with: continuous, discrete, datetime, or date.\n",
    "`scale_x_continuous()` puts the numeric values from `displ` on a continuous number line on the x-axis, `scale_color_discrete()` chooses colours for each of the `class` of car, etc.\n",
    "There are lots of non-default scales which you'll learn about below.\n",
    "\n",
    "The default scales have been carefully chosen to do a good job for a wide range of inputs.\n",
    "Nevertheless, you might want to override the defaults for two reasons:\n",
    "\n",
    "-   You might want to tweak some of the parameters of the default scale.\n",
    "    This allows you to do things like change the breaks on the axes, or the key labels on the legend.\n",
    "\n",
    "-   You might want to replace the scale altogether, and use a completely different algorithm.\n",
    "    Often you can do better than the default because you know more about the data.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c645247f",
   "metadata": {},
   "source": [
    "### Axis ticks and legend keys\n",
    "\n",
    "Collectively axes and legends get the somewhat confusing name **guides** in **lets-plot**. Axes are used for x and y aesthetics; legends are used for everything else.\n",
    "\n",
    "There are two primary arguments that affect the appearance of the ticks on the axes and the keys on the legend: `breaks` and `labels`.\n",
    "Breaks controls the position of the ticks, or the values associated with the keys. If you like, the breaks *are* the ticks.\n",
    "Labels controls the text label associated with each tick/key. We might more accurately call these *tick labels*.\n",
    "The most common use of `breaks` is to override the default choice:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a95604d8",
   "metadata": {},
   "outputs": [],
   "source": [
    "(\n",
    "    ggplot(mpg, aes(x=\"displ\", y=\"hwy\", color=\"drv\"))\n",
    "    + geom_point()\n",
    "    + scale_y_continuous(breaks=np.arange(15, 40, step=5))\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bd1113b7",
   "metadata": {},
   "source": [
    "You can use `labels` in the same way (ie pass in an array or list of strings the same length as `breaks`). To remove them altogether, you would have to use a theme, though, a topic we'll return to later.\n",
    "You can also use `breaks` and `labels` to control the appearance of legends.\n",
    "For discrete scales for categorical variables, `labels` can be a named list of the existing levels names and the desired labels for them.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1a852304",
   "metadata": {},
   "outputs": [],
   "source": [
    "(\n",
    "    ggplot(mpg, aes(x=\"displ\", y=\"hwy\", color=\"drv\"))\n",
    "    + geom_point()\n",
    "    + scale_color_discrete(labels=[\"4-wheel\", \"front\", \"rear\"])\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "184dfb30",
   "metadata": {},
   "source": [
    "To change the formatting of the tick labels, use the `format=` keyword argument. This is useful to render currencies, percentages, and so on—though it's often easier for the reader to just see this symbol once in the axis label.\n",
    "\n",
    "In the example below, we read in the `diamonds` dataset and then format it with a command `format=\"$.2s\"`; let's break this down:\n",
    "\n",
    "- the dollar sign says put a dollar sign in front of every number\n",
    "- the .2 says use two significant digits\n",
    "- the s says, use the Système International (SI)\n",
    "\n",
    "There are a wealth of alternative options for formatting—it's best to use the [helpful page on formatting](https://lets-plot.org/pages/formats.html) in the documentation of **lets-plot** to find out more."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "40ac230e",
   "metadata": {},
   "outputs": [],
   "source": [
    "diamonds = pd.read_csv(\n",
    "    \"https://vincentarelbundock.github.io/Rdatasets/csv/ggplot2/diamonds.csv\",\n",
    "    index_col=0,\n",
    ")\n",
    "diamonds[\"cut\"] = diamonds[\"cut\"].astype(\n",
    "    pd.CategoricalDtype(\n",
    "        categories=[\"Fair\", \"Good\", \"Very Good\", \"Premium\", \"Ideal\"], ordered=True\n",
    "    )\n",
    ")\n",
    "diamonds[\"color\"] = diamonds[\"color\"].astype(\n",
    "    pd.CategoricalDtype(categories=[\"D\", \"E\", \"F\", \"G\", \"H\", \"I\", \"J\"], ordered=True)\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1520bb3c",
   "metadata": {},
   "outputs": [],
   "source": [
    "(\n",
    "    ggplot(diamonds, aes(x=\"cut\", y=\"price\"))\n",
    "    + geom_boxplot()\n",
    "    + coord_flip()\n",
    "    + scale_y_continuous(format=\"$.2s\", breaks=np.arange(0, 19000, step=6000))\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6f2edc1b",
   "metadata": {},
   "source": [
    "Another use of breaks is when you have relatively few data points and want to highlight exactly where the observations occur. For example, take this plot that shows when each US president started and ended their term."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9d1f993a",
   "metadata": {},
   "outputs": [],
   "source": [
    "presidential = pd.read_csv(\n",
    "    \"https://vincentarelbundock.github.io/Rdatasets/csv/ggplot2/presidential.csv\",\n",
    "    index_col=0,\n",
    ")\n",
    "presidential = presidential.astype({\"start\": \"datetime64[ns]\", \"end\": \"datetime64[ns]\"})\n",
    "presidential[\"id\"] = 33 + presidential.index\n",
    "presidential.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cd2cc430",
   "metadata": {},
   "outputs": [],
   "source": [
    "(\n",
    "    ggplot(presidential, aes(x=\"start\", y=\"id\"))\n",
    "    + geom_point()\n",
    "    + geom_segment(aes(xend=\"end\", yend=\"id\"))\n",
    "    + scale_x_datetime()\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8b451c76",
   "metadata": {},
   "source": [
    "### Legend layout\n",
    "\n",
    "You will most often use `breaks` and `labels` to tweak the axes.\n",
    "While they both also work for legends, there are a few other techniques you are more likely to use.\n",
    "\n",
    "To control the overall position of the legend, you need to use a `theme()` setting.\n",
    "We'll come back to themes at the end of the chapter, but in brief, they control the non-data parts of the plot.\n",
    "The theme setting `legend.position` controls where the legend is drawn, and to demonstrate this we'll use `gggrid()` to arrange all of the plots."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "52d6e86a",
   "metadata": {},
   "outputs": [],
   "source": [
    "base = ggplot(mpg, aes(x=\"displ\", y=\"hwy\")) + geom_point(aes(color=\"class\"))\n",
    "\n",
    "p1 = base + theme(legend_position=\"right\")  # the default\n",
    "p2 = base + theme(legend_position=\"left\")\n",
    "p3 = base + theme(legend_position=\"top\") + guides(color=guide_legend(nrow=3))\n",
    "p4 = base + theme(legend_position=\"bottom\") + guides(color=guide_legend(nrow=3))\n",
    "\n",
    "gggrid([p1, p2, p3, p4], ncol=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ce2507b",
   "metadata": {},
   "source": [
    "If your plot is short and wide, place the legend at the top or bottom, and if it's tall and narrow, place the legend at the left or right. You can also use `legend_position = \"none\"` to suppress the display of the legend altogether.\n",
    "\n",
    "To control the display of individual legends, use `guides()` along with `guide_legend()` or `guide_colorbar()`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f27913c7",
   "metadata": {},
   "source": [
    "\n",
    "### Replacing a scale\n",
    "\n",
    "Instead of just tweaking the details a little, you can instead replace the scale altogether.\n",
    "There are two types of scales you're mostly likely to want to switch out: continuous position scales and colour scales.\n",
    "Fortunately, the same principles apply to all the other aesthetics, so once you've mastered position and colour, you'll be able to quickly pick up other scale replacements.\n",
    "\n",
    "It's very useful to plot transformations of your variable.\n",
    "For example, it's easier to see the precise relationship between `carat` and `price` if we log transform them. The way to do this is by using an `apply()` function on the data that gets sent to `ggplot`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2c1d3f8d",
   "metadata": {},
   "outputs": [],
   "source": [
    "(\n",
    "    ggplot(\n",
    "        diamonds.apply({\"carat\": np.log10, \"price\": np.log10}),\n",
    "        aes(x=\"carat\", y=\"price\"),\n",
    "    )\n",
    "    + geom_bin2d()\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f19dbbec",
   "metadata": {},
   "source": [
    "However, the disadvantage of this transformation is that the axes are now mislabelled with the original values, making it hard to interpret the plot. Instead of doing the transformation in the aesthetic mapping, we can instead do it with the scale. This is visually identical, except the axes are labelled on the original data scale."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "39b4ef8d",
   "metadata": {},
   "outputs": [],
   "source": [
    "(\n",
    "    ggplot(diamonds, aes(x=\"carat\", y=\"price\"))\n",
    "    + geom_bin2d()\n",
    "    + scale_x_log10()\n",
    "    + scale_y_log10()\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4402c4de",
   "metadata": {},
   "source": [
    "Another scale that is frequently customised is colour. The default categorical scale picks colors that are evenly spaced around the color wheel. Useful alternatives are the ColorBrewer scales which have been hand tuned to work better for people with common types of colour blindness. The two plots below look similar, but there is enough difference in the shades of red and green that the dots on the right can be distinguished even by people with red-green colour blindness."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f06d7e40",
   "metadata": {},
   "outputs": [],
   "source": [
    "(ggplot(mpg, aes(x=\"displ\", y=\"hwy\")) + geom_point(aes(color=\"drv\")))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6186b520",
   "metadata": {},
   "outputs": [],
   "source": [
    "(\n",
    "    ggplot(mpg, aes(x=\"displ\", y=\"hwy\"))\n",
    "    + geom_point(aes(color=\"drv\"))\n",
    "    + scale_color_brewer(palette=\"Set1\")\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f19af6ad",
   "metadata": {},
   "source": [
    "Don't forget simpler techniques for improving accessibility.\n",
    "If there are just a few colors, you can add a redundant shape mapping.\n",
    "This will also help ensure your plot is interpretable in black and white."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "253af5a4",
   "metadata": {},
   "source": [
    "The ColorBrewer scales are documented online at <https://colorbrewer2.org/>. The sequential (top) and diverging (bottom) palettes are particularly useful if your categorical values are ordered, or have a \"middle\". This often arises if you've used `pd.cut()` to make a continuous variable into a categorical variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bd347524",
   "metadata": {
    "tags": [
     "remove-input"
    ]
   },
   "outputs": [],
   "source": [
    "# remove-input\n",
    "cmaps = [\n",
    "    (\n",
    "        \"Perceptually Uniform Sequential\",\n",
    "        [\"viridis\", \"plasma\", \"inferno\", \"magma\", \"cividis\"],\n",
    "    ),\n",
    "    (\n",
    "        \"Sequential\",\n",
    "        [\n",
    "            \"Blues\",\n",
    "            \"BuGn\",\n",
    "            \"BuPu\",\n",
    "            \"GnBu\",\n",
    "            \"Greens\",\n",
    "            \"Greys\",\n",
    "            \"Oranges\",\n",
    "            \"OrRd\",\n",
    "            \"PuBu\",\n",
    "            \"PuBuGn\",\n",
    "            \"PuRd\",\n",
    "            \"Purples\",\n",
    "            \"RdPu\",\n",
    "            \"Reds\",\n",
    "            \"YlGn\",\n",
    "            \"YlGnBu\",\n",
    "            \"YlOrBr\",\n",
    "            \"YlOrRd\",\n",
    "        ],\n",
    "    ),\n",
    "    (\n",
    "        \"Diverging\",\n",
    "        [\n",
    "            \"BrBG\",\n",
    "            \"PiYG\",\n",
    "            \"PRGn\",\n",
    "            \"PuOr\",\n",
    "            \"RdBu\",\n",
    "            \"RdGy\",\n",
    "            \"RdYlBu\",\n",
    "            \"RdYlGn\",\n",
    "        ],\n",
    "    ),\n",
    "    (\n",
    "        \"Qualitative\",\n",
    "        [\n",
    "            \"Pastel1\",\n",
    "            \"Pastel2\",\n",
    "            \"Paired\",\n",
    "            \"Accent\",\n",
    "            \"Dark2\",\n",
    "            \"Set1\",\n",
    "            \"Set2\",\n",
    "            \"Set3\",\n",
    "            \"tab10\",\n",
    "            \"tab20\",\n",
    "            \"tab20b\",\n",
    "            \"tab20c\",\n",
    "        ],\n",
    "    ),\n",
    "]\n",
    "\n",
    "\n",
    "gradient = np.linspace(0, 1, 256)\n",
    "gradient = np.vstack((gradient, gradient))\n",
    "\n",
    "\n",
    "def plot_color_gradients(cmap_category, cmap_list):\n",
    "    # Create figure and adjust figure height to number of colourmaps\n",
    "    nrows = len(cmap_list)\n",
    "    figh = 0.35 + 0.15 + (nrows + (nrows - 1) * 0.1) * 0.22\n",
    "    fig, axs = plt.subplots(nrows=nrows, figsize=(6.4, figh))\n",
    "    fig.subplots_adjust(top=1 - 0.35 / figh, bottom=0.15 / figh, left=0.2, right=0.99)\n",
    "\n",
    "    axs[0].set_title(cmap_category + \" colormaps\", fontsize=14)\n",
    "\n",
    "    for ax, name in zip(axs, cmap_list):\n",
    "        ax.imshow(gradient, aspect=\"auto\", cmap=plt.get_cmap(name))\n",
    "        ax.text(\n",
    "            -0.01,\n",
    "            0.5,\n",
    "            name,\n",
    "            va=\"center\",\n",
    "            ha=\"right\",\n",
    "            fontsize=10,\n",
    "            transform=ax.transAxes,\n",
    "        )\n",
    "\n",
    "    # Turn off *all* ticks & spines, not just the ones with colourmaps.\n",
    "    for ax in axs:\n",
    "        ax.set_axis_off()\n",
    "\n",
    "\n",
    "for cmap_category, cmap_list in cmaps[1:2]:\n",
    "    plot_color_gradients(cmap_category, cmap_list)\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d6350c71",
   "metadata": {
    "tags": [
     "remove-input"
    ]
   },
   "outputs": [],
   "source": [
    "# remove input\n",
    "for cmap_category, cmap_list in cmaps[3:4]:\n",
    "    plot_color_gradients(cmap_category, cmap_list)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0063a574",
   "metadata": {
    "tags": [
     "remove-input"
    ]
   },
   "outputs": [],
   "source": [
    "# remove input\n",
    "for cmap_category, cmap_list in cmaps[2:3]:\n",
    "    plot_color_gradients(cmap_category, cmap_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c32c2237",
   "metadata": {},
   "source": [
    "When you have a predefined mapping between values and colours, use `scale_color_manual()`. For example, if we map presidential party to colour, we want to use the standard mapping of red for Republicans and blue for Democrats. One approach for assigning these colors is using hex colour codes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9751058d",
   "metadata": {},
   "outputs": [],
   "source": [
    "mini_presid = presidential.iloc[5:, :]\n",
    "\n",
    "(\n",
    "    ggplot(mini_presid, aes(x=\"start\", y=\"id\", color=\"party\"))\n",
    "    + geom_point(size=3)\n",
    "    + geom_segment(aes(xend=\"end\", yend=\"id\"), size=1)\n",
    "    + scale_x_datetime(breaks=mini_presid[\"start\"], format=\"%Y\")\n",
    "    + scale_color_manual(values=[\"#00AEF3\", \"#E81B23\"], name=\"party\")\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6be370b4",
   "metadata": {},
   "source": [
    "You can also use typical colour names such as \"red\" and \"blue\".\n",
    "\n",
    "For continuous colour, you can use the built-in `scale_color_gradient()` or `scale_fill_gradient()`.\n",
    "If you have a diverging scale, you can use `scale_color_gradient2()`. That allows you to give, for example, positive and negative values different colors. That's sometimes also useful if you want to distinguish points above or below the mean.\n",
    "\n",
    "Another option is to use the viridis, magma, inferno, and plasma color scales developed for the extremely powerful imperative Python plotting package **[matplotlib](https://matplotlib.org/)**. The designers, Nathaniel Smith and Stéfan van der Walt, carefully tailored continuous color schemes that are perceptible to people with various forms of color blindness as well as perceptually uniform in both color and black and white. These scales are available as palettes in *lets-plot*. Here's an example using the continuous version of viridis (we'll generate some random data first):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "644fd814",
   "metadata": {},
   "outputs": [],
   "source": [
    "prng = np.random.default_rng(1837)  # prng=probabilistic random number generator\n",
    "df_rnd = pd.DataFrame(prng.standard_normal((1000, 2)), columns=[\"x\", \"y\"])\n",
    "(\n",
    "    ggplot(df_rnd, aes(x=\"x\", y=\"y\"))\n",
    "    + geom_bin2d()\n",
    "    + coord_fixed()\n",
    "    + scale_fill_viridis(option=\"plasma\")\n",
    "    + labs(title=\"Plasma, continuous\")\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e7cf0931",
   "metadata": {},
   "source": [
    "### Zooming\n",
    "\n",
    "There are three ways to control the plot limits:\n",
    "\n",
    "1.  Adjusting what data are plotted.\n",
    "2.  Setting the limits in each scale.\n",
    "3.  Setting `xlim` and `ylim` in `coord_cartesian()`.\n",
    "\n",
    "We'll demonstrate these options in a series of plots.\n",
    "The first plot shows the relationship between engine size and fuel efficiency, coloured by type of drive train.\n",
    "The second plot shows the same variables, but subsets the data that are plotted.\n",
    "Subsetting the data has affected the x and y scales as well as the smooth curve.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "25a29f38",
   "metadata": {},
   "outputs": [],
   "source": [
    "(\n",
    "    ggplot(mpg, aes(x=\"displ\", y=\"hwy\"))\n",
    "    + geom_point(aes(color=\"drv\"))\n",
    "    + geom_smooth(method=\"loess\")\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "42318a59",
   "metadata": {},
   "outputs": [],
   "source": [
    "mpg_condition = (\n",
    "    (mpg[\"displ\"] >= 5) & (mpg[\"displ\"] <= 6) & (mpg[\"hwy\"] >= 10) & (mpg[\"hwy\"] <= 25)\n",
    ")\n",
    "\n",
    "(\n",
    "    ggplot(mpg.loc[mpg_condition], aes(x=\"displ\", y=\"hwy\"))\n",
    "    + geom_point(aes(color=\"drv\"))\n",
    "    + geom_smooth(method=\"loess\")\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ec4c07d0",
   "metadata": {},
   "source": [
    "Let's compare these to the two plots below where the first plot sets the `limits` on individual scales and the second plot sets them in `coord_cartesian()`.\n",
    "We can see that reducing the limits is equivalent to subsetting the data.\n",
    "Therefore, to zoom in on a region of the plot, it's generally best to use `coord_cartesian()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "03001d5e",
   "metadata": {},
   "outputs": [],
   "source": [
    "(\n",
    "    ggplot(mpg, aes(x=\"displ\", y=\"hwy\"))\n",
    "    + geom_point(aes(color=\"drv\"))\n",
    "    + geom_smooth(method=\"loess\")\n",
    "    + scale_x_continuous(limits=(5, 6))\n",
    "    + scale_y_continuous(limits=(10, 25))\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dc3bb833",
   "metadata": {},
   "outputs": [],
   "source": [
    "(\n",
    "    ggplot(mpg, aes(x=\"displ\", y=\"hwy\"))\n",
    "    + geom_point(aes(color=\"drv\"))\n",
    "    + geom_smooth(method=\"loess\")\n",
    "    + coord_cartesian(xlim=(5, 6), ylim=(10, 25))\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5d1fc3ee",
   "metadata": {},
   "source": [
    "On the other hand, setting the `limits` on individual scales is generally more useful if you want to *expand* the limits, e.g., to match scales across different plots.\n",
    "For example, if we extract two classes of cars and plot them separately, it's difficult to compare the plots because all three scales (the x-axis, the y-axis, and the colour aesthetic) have different ranges."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aee538a8",
   "metadata": {},
   "outputs": [],
   "source": [
    "suv = mpg.loc[mpg[\"class\"] == \"suv\"]\n",
    "compact = mpg.loc[mpg[\"class\"] == \"compact\"]\n",
    "(ggplot(suv, aes(x=\"displ\", y=\"hwy\", color=\"drv\")) + geom_point())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a82c8c23",
   "metadata": {},
   "outputs": [],
   "source": [
    "(ggplot(compact, aes(x=\"displ\", y=\"hwy\", color=\"drv\")) + geom_point())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "be777179",
   "metadata": {},
   "source": [
    "One way to overcome this problem is to share scales across multiple plots, training the scales with the `limits` of the full data.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "db6fce43",
   "metadata": {},
   "outputs": [],
   "source": [
    "x_scale = scale_x_continuous(limits=mpg[\"displ\"].agg([\"max\", \"min\"]).tolist())\n",
    "y_scale = scale_y_continuous(limits=mpg[\"hwy\"].agg([\"max\", \"min\"]).tolist())\n",
    "col_scale = scale_color_discrete(limits=mpg[\"drv\"].unique())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dd9e6606",
   "metadata": {},
   "outputs": [],
   "source": [
    "(\n",
    "    ggplot(suv, aes(x=\"displ\", y=\"hwy\", color=\"drv\"))\n",
    "    + geom_point()\n",
    "    + x_scale\n",
    "    + y_scale\n",
    "    + col_scale\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bdd8b2c5",
   "metadata": {},
   "outputs": [],
   "source": [
    "(\n",
    "    ggplot(compact, aes(x=\"displ\", y=\"hwy\", color=\"drv\"))\n",
    "    + geom_point()\n",
    "    + x_scale\n",
    "    + y_scale\n",
    "    + col_scale\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "577d8648",
   "metadata": {},
   "source": [
    "In this particular case, you could have simply used faceting, but this technique is useful more generally, if for instance, you want to spread plots over multiple pages of a report.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4094830b",
   "metadata": {},
   "source": [
    "### Exercises\n",
    "\n",
    "1.  What is the first argument to every scale?\n",
    "    How does it compare to `labs()`?\n",
    "\n",
    "2.  Change the display of the presidential terms by:\n",
    "\n",
    "    a.  Combining the two variants that customize colors and x axis breaks.\n",
    "    b.  Improving the display of the y axis.\n",
    "    c.  Labelling each term with the name of the president.\n",
    "    d.  Adding informative plot labels.\n",
    "    e.  Placing breaks every 4 years (this is trickier than it seems!).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8b574471",
   "metadata": {},
   "source": [
    "## Themes\n",
    "\n",
    "Finally, you can customise the non-data elements of your plot with a theme:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0b2364ca",
   "metadata": {},
   "outputs": [],
   "source": [
    "(\n",
    "    ggplot(mpg, aes(x=\"displ\", y=\"hwy\"))\n",
    "    + geom_point(aes(color=\"class\"))\n",
    "    + geom_smooth(se=False)\n",
    "    + theme_grey()\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7814bb4d",
   "metadata": {},
   "source": [
    "**lets-plot** includes several built-in themes that you can find [here](https://lets-plot.org/pages/api.html#predefined-themes). You can also create your own themes, if you are trying to match a particular corporate or journal style.\n",
    "\n",
    "Here's an example of changing multiple `theme()` settings:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "67bfa9c8",
   "metadata": {},
   "outputs": [],
   "source": [
    "(\n",
    "    ggplot(mpg, aes(x=\"displ\", color=\"drv\"))\n",
    "    + geom_density(size=2)\n",
    "    + ggtitle(\"Density of drives\")\n",
    "    + theme(\n",
    "        axis_line=element_line(size=4),\n",
    "        axis_ticks_length=10,\n",
    "        axis_title_y=\"blank\",\n",
    "        legend_position=[1, 1],\n",
    "        legend_justification=[1, 1],\n",
    "        panel_background=element_rect(color=\"black\", fill=\"#eeeeee\", size=2),\n",
    "        panel_grid=element_line(color=\"black\", size=1),\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5b05b5da",
   "metadata": {},
   "source": [
    "### Exercises\n",
    "\n",
    "1.  Make the axis labels of your plot blue and bolded.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a56216db",
   "metadata": {},
   "source": [
    "## Layout\n",
    "\n",
    "So far we talked about how to create and modify a single plot.\n",
    "What if you have multiple plots you want to lay out in a certain way? You can do that. To place two plots next to each other, you can simply put them in a list and call `gggrid()` on the list. Note that you first need to create the plots and save them as objects (in the following example they're called `p1` and `p2`).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a8081df4",
   "metadata": {},
   "outputs": [],
   "source": [
    "p1 = ggplot(mpg, aes(x=\"displ\", y=\"hwy\")) + geom_point() + labs(title=\"Plot 1\")\n",
    "p2 = ggplot(mpg, aes(x=\"drv\", y=\"hwy\")) + geom_boxplot() + labs(title=\"Plot 2\")\n",
    "gggrid([p1, p2])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b0773270",
   "metadata": {},
   "source": [
    "## Saving plots to file\n",
    "\n",
    "There are lots of output options to choose from to save your file to. Remember that, for graphics, *vector formats* are generally better than *raster formats*. In practice, this means saving plots in svg or pdf formats over jpg or png file formats. The svg format works in a lot of contexts (including Microsoft Word) and is a good default. To choose between formats, just supply the file extension and the file type will change automatically, eg \"chart.svg\" for svg or \"chart.png\" for png (thought note that raster formats often have extra options, like how many dots per inch to use).\n",
    "\n",
    "Let's try this out using the figure we made in the previous exercise, `p1`. `path=\".\"` just drops the file in the current directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "710a6a4f",
   "metadata": {},
   "outputs": [],
   "source": [
    "ggsave(p1, \"chart.svg\", path=\".\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7781794a",
   "metadata": {},
   "source": [
    "To double check this has worked, let's use the terminal. We'll try the command `ls`, which lists everything in directory, and `grep *.svg` to pull out any files that end in `.svg` from what is returned by `ls`. These are strung together as commands by a `|`. (Note that the leading exclamation mark below just tells the software that builds this book to use the terminal.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bc831b1b",
   "metadata": {},
   "outputs": [],
   "source": [
    "!ls | grep *.svg"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9cc10ab7",
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "# remove-cell\n",
    "import os\n",
    "\n",
    "os.remove(\"chart.svg\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "793f4a04",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "In this chapter you've learned about adding plot labels such as title, subtitle, caption as well as modifying default axis labels, using annotation to add informational text to your plot or to highlight specific data points, customising the axis scales, and changing the theme of your plot.\n",
    "You've also learned about combining multiple plots in a single graph using both simple and complex plot layouts.\n",
    "\n",
    "While you've so far learned about how to make many different types of plots and how to customise them using a variety of techniques, we've barely scratched the surface of what you can create with **lets-plot**.\n",
    "\n",
    "The best place to go for further information is the [**lets-plot** dcoumentation](https://lets-plot.org/)."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "cell_metadata_filter": "-all",
   "encoding": "# -*- coding: utf-8 -*-",
   "formats": "md:myst",
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.0"
  },
  "toc-showtags": true
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
